NOTE: the example data here is an uncompressed fastq file. In general, you should ALWAYS store fastq files as gzipped compressed (.fastq.gz). We use the compressed one here to show the contrast in file size after zipping it, and to let us easily use simple tools like grep and head.
mkdir ~/ngs
cd ~/ngs
ln -s /workshops/ric_workshop/common/3_NGS/test1.fastq
head test1.fastq
@HWI-D00758:53:C7MV4ANXX:7:1101:1203:2160 1:N:0:CAGGAC
CACTCTACCGTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGAAAAAAAAACTTTT
+
</B//F<F/<<BBFFBBFFB//<FFFF//F<7/BBBBFFF<7<B//<7/F///</<B///BB<FF/BB<BBF/BFB<7B/7B##################
@HWI-D00758:53:C7MV4ANXX:7:1101:1337:2249 1:N:0:CAGGAC
GACTATGATGCCCTAGATGTTGCCAACAAGATTGGGATCATCTAAACTGAGTCCAGATGGCTAATTCTAAATATATACTTTTTTCACCATAAAAAAAAAA
+
BBBBBFF<FFFFFFFF<FFFFFFF/<FFFF/FFFFFBFF/F<FFFFFFFFFFFFFF/FFFBFBBFF/FFFFFFFFFFFFFFFFFFF//<FFFFFFFFFFF
@HWI-D00758:53:C7MV4ANXX:7:1101:1698:2127 1:N:0:CAGGAC
TGGGCTACATTTTCTTATAAAAGAACATTACTATACCCTTTATGAAACTAAAGGACTAAGGAGGATTTAGTAGTAAATTAAGAATAGAGAGCTTAATTGA
wc -l test1.fastq
100000 test1.fastq
ls -lhL test1.fastq
-rw-rw-r--. 1 gchlip2 rrc_shared 6.3M May 20 09:58 test1.fastq
gzip -c test1.fastq > test1.fastq.gz
ls -lh test1.fastq.gz
-rw-rw-r-- 1 gchlip2 gchlip2 1.2M Jun 20 14:21 test1.fastq.gz
cp ~/slurm_template.sh run_fastqc.sh
nano run_fastqc.sh
To look like the following. You can also copy from
/workshops/ric_workshop/common/3_NGS/run_fastqc.sh
#!/bin/sh
#SBATCH --partition=batch
#SBATCH --account=ric_workshop
#SBATCH -o %x.o%J
cd $SLURM_SUBMIT_DIR
module load FastQC
fastqc test1.fastq.gz
Remember: Ctrl+O to save, Ctrl+X to exit from nano.
sbatch run_fastqc.sh
test1_fastqc.html) and view in a browser, i.e. double
click the file.For this exercise, we will run FastQC on a set of FASTQ files and then use MultiQC to summarize the FastQC reports.
ln -s /workshops/ric_workshop/common/3_NGS/fastq_files
cp ~/slurm_template.sh run_multiqc.sh
nano run_multiqc.sh) the SLURM script to look
like the following. You can also copy from
/workshops/ric_workshop/common/3_NGS/run_multiqc.sh#!/bin/sh
#SBATCH --partition=batch
#SBATCH --account=ric_workshop
#SBATCH -o %x.o%J
cd $SLURM_SUBMIT_DIR
module load FastQC
mkdir fastqc_out
fastqc -o fastqc_out fastq_files/*.fastq.gz
module load MultiQC
multiqc fastqc_out
Remember: Ctrl+O to save, Ctrl+X to exit from nano.
sbatch run_multiqc.sh
Once the job is down, download the
multiqc_report.html
Open the multiqc_report.html in your browser (Double
click on the file)
If your SSH session closed and you had to login again, be sure
that you are in your ngs directory
cd ~/ngs
ln -s /workshops/ric_workshop/common/3_NGS/truseq.fa
cat truseq.fa
>TruSeq_R1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
>TruSeq_R2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
# zcat -f will "cat" both compressed and uncompressed text files
zcat -f test1.fastq.gz | grep --color AGATCGGAA
cp ~/slurm_template.sh trim.sh
nano trim.sh) the SLURM script to look like the
following. This SLURM script will perform the following.
--ignore parameter to command to have it ignore the
previous set of FastQC reports that we generated for the multiple files.
By default, MultiQC will scan the directory and all subdirectories of
the path given. We will also specify a different file name.You can also copy from
/workshops/ric_workshop/common/3_NGS/trim.sh
#!/bin/sh
#SBATCH --partition=batch
#SBATCH --account=ric_workshop
#SBATCH -o %x.o%J
cd $SLURM_SUBMIT_DIR
module load cutadapt
cutadapt -a file:truseq.fa -q 30 -o test1_trimmed.fastq.gz test1.fastq.gz > test1_cutadapt.log
module load FastQC
fastqc test1_trimmed.fastq.gz
module load MultiQC
multiqc --ignore fastqc_out --filename trimmed_multiqc .
Remember: Ctrl+O to save, Ctrl+X to exit from nano.
sbatch trim.sh
test1_cutadapt.log)This is cutadapt 4.4 with Python 3.10.8
Command line parameters: -a file:truseq.fa -q 30 -o test1_trimmed.fastq.gz test1.fastq.gz
Processing single-end reads on 1 core ...
Finished in 0.325 s (13.011 µs/read; 4.61 M reads/minute).
=== Summary ===
Total reads processed: 25,000
Reads with adapters: 1,689 (6.8%)
Reads written (passing filters): 25,000 (100.0%)
Total basepairs processed: 2,500,000 bp
Quality-trimmed: 335,454 bp (13.4%)
Total written (filtered): 2,125,995 bp (85.0%)
=== Adapter TruSeq_R1 ===
Sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA; Type: regular 3'; Length: 33; Trimmed: 1689 times
Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-33 bp: 3
Bases preceding removed adapters:
A: 94.1%
C: 2.3%
G: 1.4%
T: 2.0%
none/other: 0.2%
WARNING:
The adapter is preceded by 'A' extremely often.
The provided adapter sequence could be incomplete at its 5' end.
Ignore this warning when trimming primers.
Overview of removed sequences
length count expect max.err error counts
3 183 390.6 0 183
4 98 97.7 0 98
5 40 24.4 0 40
...
73 6 0.0 3 2 1 3
74 17 0.0 3 5 7 4 1
75 1 0.0 3 1
=== Adapter TruSeq_R2 ===
Sequence: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT; Type: regular 3'; Length: 33; Trimmed: 0 times
WARNING:
One or more of your adapter sequences may be incomplete.
Please see the detailed output above.
test1_fastqc.html,
test1_trimmed_fastqc.html and
trimmed_multiqc.html files using SFTP. Once downloaded,
open the reports – they should open in a web browser.Raw data FastQC
Trimmed data FastQC
SIDEBAR
By default, the base unit for read counts in MultiQC is million (M).
If you want to change to a different base unit, create a configuration
file, e.g. my_report.conf, with settings for
read_count_multiplier, read_count_prefix and
read_count_desc. The following is an example configuration
file that would set the base unit to thousands (k)
read_count_multiplier: 0.001
read_count_prefix: "k"
read_count_desc: "thousands"
Once you have created the configuration file, use the -c
option of MultiQC to have it use the configuration file.
multiqc -c my_report.conf .
If your SSH session closed and you had to login again, be sure
that you are in your ngs directory
cd ~/ngs
ln -s /workshops/ric_workshop/common/3_NGS/mm39
ls -lhL mm39
total 8.1G
drwxrwsr-x 4 gchlip2 rrc_shared 512 Jun 10 20:06 Bisulfite_Genome
-rw-rw-r-- 1 gchlip2 rrc_shared 2.6G May 29 14:09 mm39.fa
-rw-rw-r-- 1 gchlip2 rrc_shared 5.8K May 29 17:00 mm39.fa.amb
-rw-rw-r-- 1 gchlip2 rrc_shared 5.1K May 29 17:00 mm39.fa.ann
-rw-rw-r-- 1 gchlip2 rrc_shared 2.6G May 29 17:00 mm39.fa.bwt
-rw-rw-r-- 1 gchlip2 rrc_shared 2.0K May 29 16:33 mm39.fa.fai
-rw-rw-r-- 1 gchlip2 rrc_shared 651M May 29 17:00 mm39.fa.pac
-rw-rw-r-- 1 gchlip2 rrc_shared 1.3G May 29 17:10 mm39.fa.sa
-rw-rw-r-- 1 gchlip2 rrc_shared 1.1G May 29 16:33 mm39.gtf
ln -s /workshops/ric_workshop/common/3_NGS/atac_subset_R1.fastq.gz
ln -s /workshops/ric_workshop/common/3_NGS/atac_subset_R2.fastq.gz
(exercise continues on next page)
cp ~/slurm_template.sh mapping.sh
NOTE: you can add your fastq file names to the script ahead of time like this:
ls atac_subset*.fastq.gz >> mapping.sh
nano mapping.sh) the SLURM script look like this
(bwa mem and samtools commands should be on the same
line).
#SBATCH --mem=10G). bwa-mem requires
sufficient memory to load the genome index.#SBATCH --ntasks-per-node=4). bwa-mem can run
some processes in parallel and can take advantage of multiple processing
cores to run faster.You can also copy from
/workshops/ric_workshop/common/3_NGS/mapping.sh
#!/bin/sh
#SBATCH --mem=10G
#SBATCH --ntasks-per-node=4
#SBATCH --partition=batch
#SBATCH --account=ric_workshop
#SBATCH -o %x.o%J
cd $SLURM_SUBMIT_DIR
module load BWA
module load SAMtools
bwa mem -t $SLURM_TASKS_PER_NODE mm39/mm39.fa atac_subset_R1.fastq.gz atac_subset_R2.fastq.gz |
samtools view -b - > atac_subset.bam
Remember: Ctrl+O to save, Ctrl+X to exit from nano.
sbatch mapping.sh
If your last step (genome alignment) did not finish yet, you can link to the bam file below:
ln -s /workshops/ric_workshop/common/3_NGS/atac_subset.bam
If your SSH session closed and you had to login again, be sure
that you are in your ngs directory
cd ~/ngs
module load SAMtools
samtools view atac_subset.bam | head
NB501189:35:HTJL2BGXY:1:11101:12775:1048 121 19 37684741 60 41M = 37684741 0 GGAGACGTTTAGGAACCACCGCAGGAGACCACATANTGAAC EEEEEEA/AEEEEEEEEEEEEEAEEEEEEEE/EEE#AAAAA NM:i:1 MD:Z:35T5 MQ:i:0 AS:i:39 XS:i:0
NB501189:35:HTJL2BGXY:1:11101:12775:1048 181 19 37684741 0 * = 37684741 0 TNCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGCCGGACN E#EE############################EEEEAAAA# MC:Z:41M MQ:i:60 AS:i:0 XS:i:0
NB501189:35:HTJL2BGXY:1:11101:12274:1048 121 1 98511468 0 41M = 98511468 0 GATGAAAGGATGGACCATCTTGAGACTGCCATATCNAGGGA EEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE#AAAAA NM:i:1 MD:Z:35C5 MQ:i:0 AS:i:39 XS:i:39
NB501189:35:HTJL2BGXY:1:11101:12274:1048 181 1 98511468 0 * = 98511468 0 TNATNNGNNNNNNNNNNNNNNNNNNNNNNNNNGGGTATTCN E#EE##E#########################EEEEAAAA# MC:Z:41M MQ:i:0 AS:i:0 XS:i:0
NB501189:35:HTJL2BGXY:1:11101:12088:1048 73 11 102174317 60 41M = 102174317 0 GAAAGNAAGAAAGAAAAATGATTAAGGAGGGGCCAGTGAGA AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NM:i:1 MD:Z:5A35 MQ:i:0 AS:i:39 XS:i:20
NB501189:35:HTJL2BGXY:1:11101:12088:1048 133 11 102174317 0 * = 102174317 0 NCCCATACCNNNNNNNNNNNNNNNNNNNNNNNNNTNNTCNT #AAAAEEEE#########################E##EA#E MC:Z:41M MQ:i:60 AS:i:0 XS:i:0
NB501189:35:HTJL2BGXY:1:11101:2303:1049 121 2 35873055 60 41M = 35873055 0 GAGGATGTACTCAGTAACTATTGTATGTATGTATGNATGTA AEEEEAEAEE<EAEEE6EE6EEEEE/EEEE/EEE/#A/AAA NM:i:1 MD:Z:35T5 MQ:i:0 AS:i:39 XS:i:20
NB501189:35:HTJL2BGXY:1:11101:2303:1049 181 2 35873055 0 * = 35873055 0 ANAGCNCNNNNNNNNNNNNNNNNNNNNNNNNNACCCTGACN E#/EE#E#########################A/EEAAAA# MC:Z:41M MQ:i:60 AS:i:0 XS:i:0
NB501189:35:HTJL2BGXY:1:11101:16869:1049 73 3 39342697 0 41M = 39342697 0 CCCATNAACATACAAGAAGCATACAGAACTCCAAATAGACT AAAAA#EEEEEEEEE6EEEEEEEEEEEEEEEAEEEEEEEEE NM:i:1 MD:Z:5G35 MQ:i:0 AS:i:39 XS:i:39
NB501189:35:HTJL2BGXY:1:11101:16869:1049 133 3 39342697 0 * = 39342697 0 NTATAGTAGNNNNNNNNNNNNNNNNNNNNNNNNNANTGTNT #AAAAEEEE#########################E#/EE#E MC:Z:41M MQ:i:0 AS:i:0 XS:i:0
samtools view -h atac_subset.bam | head
@HD VN:1.5 SO:unsorted GO:query
@SQ SN:1 LN:195154279
@SQ SN:2 LN:181755017
@SQ SN:3 LN:159745316
@SQ SN:4 LN:156860686
@SQ SN:5 LN:151758149
@SQ SN:6 LN:149588044
@SQ SN:7 LN:144995196
@SQ SN:8 LN:130127694
@SQ SN:9 LN:124359700
NOTE: for a real data set, this should be run as a SLURM job.
samtools view -h atac_subset.bam > atac_subset.sam
ls -lhL atac_subset.bam
ls -lhL atac_subset.sam
-rw-rw-r-- 1 gchlip2 rrc_shared 9.1M May 30 15:39 atac_subset.bam
-rw-rw-r-- 1 gchlip2 gchlip2 41M Jun 20 14:21 atac_subset.sam
If you need to convert a SAM file to BAM file, just use the
-b flag.
samtools view -b atac_subset.sam > atac_subset_converted.bam
If your SSH session closed and you had to login again, be sure
that you are in your ngs directory
cd ~/ngs
module load SAMtools
samtools flagstat atac_subset.bam
200000 + 0 in total (QC-passed reads + QC-failed reads)
200000 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
197766 + 0 mapped (98.88% : N/A)
197766 + 0 primary mapped (98.88% : N/A)
200000 + 0 paired in sequencing
100000 + 0 read1
100000 + 0 read2
195340 + 0 properly paired (97.67% : N/A)
196560 + 0 with itself and mate mapped
1206 + 0 singletons (0.60% : N/A)
490 + 0 with mate mapped to a different chr
216 + 0 with mate mapped to a different chr (mapQ>=5)
samtools flags
About: Convert between textual and numeric flag representation
Usage: samtools flags FLAGS...
Each FLAGS argument is either an INT (in decimal/hexadecimal/octal) representing
a combination of the following numeric flag values, or a comma-separated string
NAME,...,NAME representing a combination of the following flag names:
0x1 1 PAIRED paired-end / multiple-segment sequencing technology
0x2 2 PROPER_PAIR each segment properly aligned according to aligner
0x4 4 UNMAP segment unmapped
0x8 8 MUNMAP next segment in the template unmapped
0x10 16 REVERSE SEQ is reverse complemented
0x20 32 MREVERSE SEQ of next segment in template is rev.complemented
0x40 64 READ1 the first segment in the template
0x80 128 READ2 the last segment in the template
0x100 256 SECONDARY secondary alignment
0x200 512 QCFAIL not passing quality controls or other filters
0x400 1024 DUP PCR or optical duplicate
0x800 2048 SUPPLEMENTARY supplementary alignment
samtools flags 4
0x4 4 UNMAP
samtools flags UNMAP,PAIRED
0x5 5 PAIRED,UNMAP
samtools view -F 4 atac_subset.bam | head
NB501189:35:HTJL2BGXY:1:11101:12775:1048 121 19 37684741 60 41M = 37684741 0 GGAGACGTTTAGGAACCACCGCAGGAGACCACATANTGAAC EEEEEEA/AEEEEEEEEEEEEEAEEEEEEEE/EEE#AAAAA NM:i:1 MD:Z:35T5 MQ:i:0 AS:i:39 XS:i:0
NB501189:35:HTJL2BGXY:1:11101:12274:1048 121 1 98511468 0 41M = 98511468 0 GATGAAAGGATGGACCATCTTGAGACTGCCATATCNAGGGA EEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE#AAAAA NM:i:1 MD:Z:35C5 MQ:i:0 AS:i:39 XS:i:39
NB501189:35:HTJL2BGXY:1:11101:12088:1048 73 11 102174317 60 41M = 102174317 0 GAAAGNAAGAAAGAAAAATGATTAAGGAGGGGCCAGTGAGA AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NM:i:1 MD:Z:5A35 MQ:i:0 AS:i:39 XS:i:20
NB501189:35:HTJL2BGXY:1:11101:2303:1049 121 2 35873055 60 41M = 35873055 0 GAGGATGTACTCAGTAACTATTGTATGTATGTATGNATGTA AEEEEAEAEE<EAEEE6EE6EEEEE/EEEE/EEE/#A/AAA NM:i:1 MD:Z:35T5 MQ:i:0 AS:i:39 XS:i:20
NB501189:35:HTJL2BGXY:1:11101:16869:1049 73 3 39342697 0 41M = 39342697 0 CCCATNAACATACAAGAAGCATACAGAACTCCAAATAGACT AAAAA#EEEEEEEEE6EEEEEEEEEEEEEEEAEEEEEEEEE NM:i:1 MD:Z:5G35 MQ:i:0 AS:i:39 XS:i:39
NB501189:35:HTJL2BGXY:1:11101:4079:1049 73 13 101638401 60 41M = 101638401 0 GTTTCNGGCAGCCCTGTCTCCTGCCAAACTTGGGCACTTGG AAAAA#EEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEE NM:i:1 MD:Z:5A35 MQ:i:0 AS:i:39 XS:i:19
NB501189:35:HTJL2BGXY:1:11101:9468:1049 121 2 92996906 60 41M = 92996906 0 TGCCCGCTCCCTTCCCTCGGCTACTCTCTGCCTTTNTTGGA /EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE#AAAAA NM:i:1 MD:Z:35G5 MQ:i:0 AS:i:39 XS:i:0
NB501189:35:HTJL2BGXY:1:11101:2945:1049 73 7 79852260 60 41M = 79852260 0 GTTGGNGGCGCCTGCTCTGCCCAGGTCCCAGGGTCGATGGG AAAAA#6EEEEEEEEEAEEEEEEAEAEEAEEEEEAEEEEEE NM:i:1 MD:Z:5T35 MQ:i:0 AS:i:39 XS:i:19
NB501189:35:HTJL2BGXY:1:11101:17448:1049 73 X 12466241 46 41M = 12466241 0 GGGCTNCACAGAGTAAACCTTGTCTCGAAAAACCAAAAAGA AAAAA#EEEEEEEEEE6EEEEEEEEEEEEEEEEEEEEEEEE NM:i:1 MD:Z:5A35 MQ:i:0 AS:i:39 XS:i:31
NB501189:35:HTJL2BGXY:1:11101:8077:1049 73 1 183573501 60 41M = 183573501 0 GAGCANGAGTCTTCAATTCACGTTGTTTACAGATCATGGGA AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NM:i:1 MD:Z:5T35 MQ:i:0 AS:i:39 XS:i:0
We can use samtools view with a filter to get unmapped
reads from a BAM file, in case we want to investigate those sequences
more. The -b option tells samtools to output as BAM (i.e.,
compressed), which is better if we’re saving into a new file.
NOTE: for a real data set, these should be run as a SLURM job.
samtools view -f 4 -b atac_subset.bam > atac_subset_unmapped.bam
samtools flagstat atac_subset_unmapped.bam
2234 + 0 in total (QC-passed reads + QC-failed reads)
2234 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 primary mapped (0.00% : N/A)
2234 + 0 paired in sequencing
698 + 0 read1
1536 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
We could also do this in one command using a pipe, and without making an intermediate file. The solitary dash (-) in the last command tells samtools to read the alignment data from STDIN rather than a specific file.
NOTE if piping from samtools view
include either the -h (include headers) or -b
(BAM output) in the view command. When reading from STDIN, the
uncompressed output tends to be a bit faster.
samtools view -f 4 -h atac_subset.bam | samtools flagstat -
2234 + 0 in total (QC-passed reads + QC-failed reads)
2234 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 primary mapped (0.00% : N/A)
2234 + 0 paired in sequencing
698 + 0 read1
1536 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
In this example, we are going to look for reads in which both pairs
are not mapped (fully unmapped). For these examples, we are just going
to pipe the output to samtools flagstat to see the count
statistics. However, one could save the output to a BAM file for further
analyses.
Also, there are other options for filtering. For more details you can
check the usage statment samtools view --help or check the
documentation at https://www.htslib.org/doc/samtools-view.html.
samtools flags
About: Convert between textual and numeric flag representation
Usage: samtools flags FLAGS...
Each FLAGS argument is either an INT (in decimal/hexadecimal/octal) representing
a combination of the following numeric flag values, or a comma-separated string
NAME,...,NAME representing a combination of the following flag names:
0x1 1 PAIRED paired-end / multiple-segment sequencing technology
0x2 2 PROPER_PAIR each segment properly aligned according to aligner
0x4 4 UNMAP segment unmapped
0x8 8 MUNMAP next segment in the template unmapped
0x10 16 REVERSE SEQ is reverse complemented
0x20 32 MREVERSE SEQ of next segment in template is rev.complemented
0x40 64 READ1 the first segment in the template
0x80 128 READ2 the last segment in the template
0x100 256 SECONDARY secondary alignment
0x200 512 QCFAIL not passing quality controls or other filters
0x400 1024 DUP PCR or optical duplicate
0x800 2048 SUPPLEMENTARY supplementary alignment
samtools flags UNMAP,MUNMAP
0xc 12 UNMAP,MUNMAP
samtools view -f 12 -h atac_subset.bam | samtools flagstat -
1028 + 0 in total (QC-passed reads + QC-failed reads)
1028 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 primary mapped (0.00% : N/A)
1028 + 0 paired in sequencing
514 + 0 read1
514 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
samtools view -f UNMAP,MUNMAP -h atac_subset.bam | samtools flagstat -
1028 + 0 in total (QC-passed reads + QC-failed reads)
1028 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 primary mapped (0.00% : N/A)
1028 + 0 paired in sequencing
514 + 0 read1
514 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Here is a final example of filtering for reads that are unmapped but,
the mate is mapped. In this case, the -f option will
require reads to have this flag (UNMAP) and the -F option
will exclude any reads with this flag set (MUNMAP)
samtools view -f UNMAP -F MUNMAP -h atac_subset.bam | samtools flagstat -
1206 + 0 in total (QC-passed reads + QC-failed reads)
1206 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 primary mapped (0.00% : N/A)
1206 + 0 paired in sequencing
184 + 0 read1
1022 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
If your SSH session closed and you had to login again, be sure
that you are in your ngs directory
cd ~/ngs
NOTE: for a real data set, this should be run as a SLURM job.
samtools sort atac_subset.bam > atac_subset_sort.bam
samtools index atac_subset_sort.bam
Now we can filter reads in specific regions of the genome by the genomic coordinates. For example, an arbitrary 100bp region in chr1:
samtools view atac_subset_sort.bam 1:24650681-24650781 | wc -l
63
samtools view atac_subset_sort.bam 1:24650681-24650781 | head
NB501189:35:HTJL2BGXY:1:11101:11619:6108 163 1 24650643 0 41M = 24650675 73 TTTTAAATCTGTTTGGCGTAAGCAGATTGAGCTAGTTATAA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE NM:i:0 MD:Z:41 MC:Z:41M MQ:i:0 AS:i:41 XS:i:41 XA:Z:MT,-10975,41M,0;
NB501189:35:HTJL2BGXY:1:11101:17117:6255 99 1 24650643 0 41M = 24650797 195 TTTTAAATCTGTTTGGCGTAAGCAGATTGAGCTAGTTATAA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:41 MC:Z:41M MQ:i:0 AS:i:41 XS:i:41 XA:Z:MT,-10975,41M,0;
NB501189:35:HTJL2BGXY:1:11101:7288:10786 99 1 24650643 0 41M = 24650658 56 TTTTAAATCTGTTTGGCGTAAGCAGATTGAGCTAGTTATAA AAAAAEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:41 MC:Z:41M MQ:i:0 AS:i:41 XS:i:41 XA:Z:MT,-10975,41M,0;
NB501189:35:HTJL2BGXY:1:11101:10432:5614 83 1 24650644 0 41M = 24650620 -65 TTTAAATCTGTTTGGCGTAAGCAGATTGAGCTAGTTATAAT EEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA NM:i:0 MD:Z:41 MC:Z:41M MQ:i:0 AS:i:41 XS:i:41 XA:Z:MT,+10974,41M,0;
NB501189:35:HTJL2BGXY:1:11101:22907:13094 83 1 24650644 0 41M = 24650619 -66 TTTAAATCTGTTTGGCGTAAGCAGATTGAGCTAGTTATAAT EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA NM:i:0 MD:Z:41 MC:Z:41M MQ:i:0 AS:i:41 XS:i:41 XA:Z:MT,+10974,41M,0;
NB501189:35:HTJL2BGXY:1:11101:20582:14664 99 1 24650652 0 41M = 24650674 63 TGTTTGGCGTAAGCAGATTGAGCTAGTTATAATTATTCCTC AAAAAEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEE NM:i:0 MD:Z:41 MC:Z:41M MQ:i:0 AS:i:41 XS:i:41 XA:Z:MT,-10966,41M,0;
NB501189:35:HTJL2BGXY:1:11101:26577:4445 99 1 24650653 0 41M = 24650798 186 GTTTGGCGTAAGCAGATTGAGCTAGTTATAATTATTCCTCA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEE NM:i:0 MD:Z:41 MC:Z:41M MQ:i:0 AS:i:41 XS:i:41 XA:Z:MT,-10965,41M,0;
NB501189:35:HTJL2BGXY:1:11101:24672:10931 99 1 24650657 0 41M = 24650667 51 GGCGTAAGCAGATTGAGCTAGTTATAATTATTCCTCATAGG 6AAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:41 MC:Z:41M MQ:i:0 AS:i:41 XS:i:41 XA:Z:MT,-10961,41M,0;
NB501189:35:HTJL2BGXY:1:11101:7288:10786 147 1 24650658 0 41M = 24650643 -56 GCGTAAGCAGATTGAGCTAGTTATAATTATTCCTCATAGGG EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA NM:i:0 MD:Z:41 MC:Z:41M MQ:i:0 AS:i:41 XS:i:41 XA:Z:MT,+10960,41M,0;
NB501189:35:HTJL2BGXY:1:11101:7732:7456 163 1 24650666 0 41M = 24650675 50 AGATTGAGCTAGTTATAATTATTCCTCATAGGGAGAGAAGG AAAAAEEEEEEEEEAEAEEEEEEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:41 MC:Z:41M MQ:i:0 AS:i:41 XS:i:41 XA:Z:MT,-10952,41M,0;
We can also pipe to flagstat, but need to include the headers for this so that SAM data is properly formatted.
samtools view -h atac_subset_sort.bam 1:24650681-24650781 | samtools flagstat -
63 + 0 in total (QC-passed reads + QC-failed reads)
63 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
63 + 0 mapped (100.00% : N/A)
63 + 0 primary mapped (100.00% : N/A)
63 + 0 paired in sequencing
33 + 0 read1
30 + 0 read2
63 + 0 properly paired (100.00% : N/A)
63 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
NOTE: for a real data set, this should be run as a SLURM job.
samtools rmdup atac_subset_sort.bam atac_subset_nodups.bam
...
[bam_rmdup_core] processing reference GL456394.1...
[bam_rmdup_core] processing reference GL456396.1...
[bam_rmdup_core] processing reference JH584292.1...
[bam_rmdup_core] processing reference JH584293.1...
[bam_rmdup_core] processing reference JH584294.1...
[bam_rmdup_core] processing reference JH584296.1...
[bam_rmdup_core] processing reference JH584298.1...
[bam_rmdup_core] processing reference JH584299.1...
[bam_rmdup_core] processing reference JH584304.1...
[bam_rmdup_core] 849 / 97918 = 0.0087 in library ' '
samtools flagstat atac_subset_sort.bam
samtools flagstat atac_subset_nodups.bam
200000 + 0 in total (QC-passed reads + QC-failed reads)
200000 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
197766 + 0 mapped (98.88% : N/A)
197766 + 0 primary mapped (98.88% : N/A)
200000 + 0 paired in sequencing
100000 + 0 read1
100000 + 0 read2
195340 + 0 properly paired (97.67% : N/A)
196560 + 0 with itself and mate mapped
1206 + 0 singletons (0.60% : N/A)
490 + 0 with mate mapped to a different chr
216 + 0 with mate mapped to a different chr (mapQ>=5)
198323 + 0 in total (QC-passed reads + QC-failed reads)
198323 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
196089 + 0 mapped (98.87% : N/A)
196089 + 0 primary mapped (98.87% : N/A)
198323 + 0 paired in sequencing
99163 + 0 read1
99160 + 0 read2
193665 + 0 properly paired (97.65% : N/A)
194883 + 0 with itself and mate mapped
1206 + 0 singletons (0.61% : N/A)
490 + 0 with mate mapped to a different chr
216 + 0 with mate mapped to a different chr (mapQ>=5)
If your SSH session closed and you had to login again, be sure
that you are in your ngs directory
cd ~/ngs
ln -s /workshops/ric_workshop/common/3_NGS/atac_nodups.bam
ln -s /workshops/ric_workshop/common/3_NGS/atac_nodups.bam.bai
cp ~/slurm_template.sh ./peak_calling.sh
nano peak_calling.sh) the SLURM script look like
this Options we are setting:-g mm – this is mouse, so Macs2 knows the approximate
genome size.-B --SPRM – output a bedGraph file of the coverage
(-B), and normalize it to reads per million.
(--SPMR)--nomodel – don’t build a shifting model. This is used
for ChIP-seq to estimate where the protein was bound within a TF
fragment. For ATAC-seq we don’t need or want to do this.--nolambda – lambda is Macs2’s name for the input
control sample. We don’t have one for ATAC-seq, so we note that
here.#!/bin/bash
#SBATCH --mem=10G
#SBATCH --ntasks-per-node=1
#SBATCH --partition=batch
#SBATCH --account=ric_workshop
#SBATCH -o %x.o%J
cd $SLURM_SUBMIT_DIR
module load MACS2
macs2 callpeak -t atac_nodups.bam -f BAM -n atac -g mm -B --SPMR --nomodel --nolambda
Remember: Ctrl+O to save, Ctrl+X to exit from nano.
sbatch peak_calling.sh
NOTE: Even with a subset of data, this job can take a while to run. We will let this run and return to the lecture
If your SSH session closed and you had to login again, be sure
that you are in your ngs directory
cd ~/ngs
If your last step (peak calling) did not finish yet, you can link to the results below:
ln -s /workshops/ric_workshop/common/3_NGS/atac_peaks.narrowPeak
ln -s /workshops/ric_workshop/common/3_NGS/atac_summits.bed
ln -s /workshops/ric_workshop/common/3_NGS/atac_treat_pileup.bdg
head atac_peaks.narrowPeak
1 3820136 3820336 atac_peak_1 31 . 4.25793 5.39176 3.14871 70
1 4855527 4856425 atac_peak_2 223 . 12.7738 25.5049 22.368 339
1 4877600 4878435 atac_peak_3 310 . 15.9672 34.4755 31.0459 402
1 4927683 4929170 atac_peak_4 325 . 16.4995 36.0233 32.5467 195
1 5153075 5153839 atac_peak_5 169 . 10.6448 19.871 16.9386 453
1 5505186 5505419 atac_peak_6 61 . 5.85466 8.56696 6.12342 130
1 5893249 5893470 atac_peak_7 50 . 5.32241 7.46641 5.08531 101
1 6284396 6285693 atac_peak_8 251 . 13.8383 28.4312 25.1955 364
1 6294317 6294587 atac_peak_9 83 . 6.91914 10.8771 8.31705 140
1 6332640 6332895 atac_peak_10 50 . 5.32241 7.46641 5.08531 67
head atac_summits.bed
1 3820206 3820207 atac_peak_1 3.14871
1 4855866 4855867 atac_peak_2 22.368
1 4878002 4878003 atac_peak_3 31.0459
1 4927878 4927879 atac_peak_4 32.5467
1 5153528 5153529 atac_peak_5 16.9386
1 5505316 5505317 atac_peak_6 6.12342
1 5893350 5893351 atac_peak_7 5.08531
1 6284760 6284761 atac_peak_8 25.1955
1 6294457 6294458 atac_peak_9 8.31705
1 6332707 6332708 atac_peak_10 5.08531
head atac_treat_pileup.bdg
1 0 3050248 0.00000
1 3050248 3050416 0.12170
1 3050416 3050448 0.24339
1 3050448 3050616 0.12170
1 3050616 3051065 0.00000
1 3051065 3051089 0.12170
1 3051089 3051163 0.24339
1 3051163 3051216 0.36509
1 3051216 3051265 0.48678
1 3051265 3051279 0.36509
For this exercise, we will just pipe the output to head to get a peek
at the results. We have included the -name option to have
bedtools report the name of the peaks in the bed file with the FASTA
output. To save the getfasta output to a file you can use
the -fo option of bedtools sort to specify an output file
or redirect STDOUT to a file.
module load BEDTools
bedtools getfasta -name -fi mm39/mm39.fa -bed atac_peaks.narrowPeak | head
>atac_peak_1::1:3820136-3820336
CCTGAAAAAATGTTCAACATCCTTAATCATCAGGGAAATGCAAATCAAAACAACCCTGAGATTCCACCTCACACCAGTCAGAATGGCTAAGATCAAAAATTCAGGTGACAGCAGATGCTGGCGTGGATGTGGAGAAAGAGGAACACTCCTCCATTGTTGGTGGGATTGCAGGCTTGTACAACCACTCTGGAAATCAGTCT
>atac_peak_2::1:4855527-4856425
CAGTTTTCTAGTTGCACCCTGGGTTGCACATTGGCCAGCTGGAATGTAGAAAAAACACTGGGATTTTTATAACGACGATAGGCACTTGCTAAGCAGTTCAGTTTAAACGCAATTTTGGATGAATCGTCTGCAGTTCCCGAGTGTGTACCCTCCGCCTGCTTAGCATCCTTAACACTGGGCGCACGCGGAAGGGAGCCAGGCGCCCGCACGCGTCCATCGCCAGAGTGACGCGGCCCCTGCACTCCCCGCACCGCCGCCAAGCGTTTACCCGTTTTCTGGAGTTAGGACTGGGCTTCAGGTTGGCCAGGCTCACTCTCGGCAAGGACCGCAGCAGGTCCAGGCTGGTCCCGCAGCCGCGCGCCGTGCCAGCCATGACCCGCGGTGCTTCAAGGGCGCCCAGGAAGCGGCGTAGACTCCGCGTTGCTTTACGGCTCCGCCCGGCCCCTGTCGCGCTGCAGCGGCCGTACTGCCCCGCCCCTGTCGTGCTGCAGAGCGACCGCATCCGCGTCGGTAGGCTATGTCTTCCAGCTTCCAGGCTTGATGTCTTTTTACTTCGGGGAACTGCATCAACTTTTGTGAGATCATGCTTTATCCTCTGGCTTTGGTACCATTTTGTGAACGAAAACTGTTCTCTTCTGCCCTGTAATGACTGAACACGCTGTACGGTAAAAGTGGGGCCTCATGAAGCGAAAAGTAATCCAAAATGATGTGAACAGGTTCAAGCTTTGAGACTCTGGTAATTCAACTGTTACAGTGCCTGGCTCTAGAATTAATTGTCTAATTAGGTTCAACCTAATAGCGTGCCTTGCAGTTAACGATGGGGGGTGGGGCATACTTAGCTCCTGGGGTATGCCAGAAACTGAGTTAAAGTTAGCCTTACTAGATTCATATAAACAAC
>atac_peak_3::1:4877600-4878435
GGTCATGACATTTTGACCTTAAATGTTTGCCATGAACCGTTCCCCAGTTATACATACTATATATGCAATAAATTCAGCGCTGTGCAGAAGGGTGGCGATTTTGCCCCGTGTTCAAAACGGCTATATTTCAATGAATACCTGTCCTTTAGTTAATAACGTTGACCTGCCTGGGGAAGTGAAATCGGGAGTCACGCGTCAAGCGCGCAGGGGACCACAGCAAGAGCTCGCCCCGCCTAGCCGCGGTCCCAGCAGCGGTCCCTGCTACCTGCCGCTGGTCCAGGCACTGCGGGCGTGGATGGACAGCGAGAGGTTAGCAGCGGCCCGCACTGCTCCAGCGCGGTGCGGGAAGGCTCACTATGACGCGCACGCGCGGCCGAATCGGGGCGCGAGCTCGGGGCCGAACGCGAGGAACAGTGCAGGGCGGCGCGGGGGCGCGCACGCGCCTGAGCGCGCGCCCGGAGGGGCGGGCTGGGACTTTCGGCTGCCGGGAGCCCGAGTTCCCTTCCGCTTCCGACGCACTGTCCGCCAGCCGGTGGATGTGCGGCAACAACATGTCCGCTCCGATGCCCGCCGTTGTGCCGGCCGCCCGGAAGGCCACCGCCGCGGTGAGTGGCGGCGCCCGGGCCTGGGGCACCAGCGAGACCCAGGCCGCCGCGCGTGGATGCGTCCGCGCCGCCCCGCGGGCTCTCCGGGGTCGGGCGGGGAATCGGAGCCGGCGGGGCGCCGCCTGCCTGCCTAGCACTCTCACCCCCACCCCTGAAGCAAATTCCGAGGTCCAGTGCTGAAACAGCCGACCTGCAGCGGGAGTGTCTGTCCTGCCGCAGGAGCTCAGATT
>atac_peak_4::1:4927683-4929170
TCATGCGCTTTAAGCCCTCGGCAATGCCTGTCCTGCGTCCCAGAGAACGCTCTGCCGGAGGGGTTTCGATGGAACTCGTAGCAACCTACCGCCTACTGCCTGATCCCTCTGGCGTGAAAGCCGGACTCCGTCCAACTCCAGCTCGCCAGCAACGCGAGTCCGGATAGGGCCGGAAGTTCCAGACTGCTGGGGGCGGGAATAGATTAAAAGAACAGCGCACGCCTGAGCGAGCCACTTCTACTTTCCAGTCTCCTGCGATCGATTCGTAGTGCCTGTGTGGCGCGCGCGGTGCTTGACTGGCACGGCCTAGGGGGCGGAGCGCTTATCCCTGCCGCCGCGGGCCGGGTCTGTGAGGAAGGCCTAGGCCAGCGGCTTCGCGGCTTGTCCAACGTCCGCGCAGCTTCTCTCGCTTCCCGTCCGCTGTGGCCGGGCGAGGGGAGGCCTTCCGGAGCCATGGAGGACGAGGTGGTTCGCATTGCCAAGAAGATGGACAAAATGGTGCAGAAAAAGAATGCGGTAAGCGGGCGGGCCCGGGGCAGGCCGCGGCCGAGTCCGCGACACCGGCCCACGCGGAGTCAGTCAGGGCCGGGCCGGGCCGGGCCGCGCCGCGCGCCCTCCGGGCTTTCAGGTCCGTGCTGCCCACGGCCGCAGTGAGCCCAGGACTGAAGCCCTCGACGCTGGGCTGAGAGAATGTACGGTGGCCGCCGCCGGTTACCCCAGAGCGGCTTCGCTGGGTGCCGCGGTCAGTTGCAGACGGGCCGTAAAGTGTGTTTATCCCAGCTCTGCCTTACACCCACTTGGTAATCACAAACAGAAAATACGCCGCCCCTTGGCAAGTCGGCCTGCGTGCGATGTGCCGGGCAGCCGGCTGCTGCTGCTCCGCGGGTATGGTGCTGAGGCTGCAGCGGGACGGATGACATCCTCTGTTCCTTGATTGGTCTGTGTGGATGTGTCCCTCGGTGTGTACCACTTTCCATCCCTTTGCTTCCACACTTGACCTCTCCAGCCAGCTTCACTTGAATCTCTTCTTCCAGACGCAGCAGGTTTGCATTATCAAAACACCAGAGACTTTTGCCATTTGTGAAGTAAAAGCCGTGCAGAGATGCTAAACGTTTGTGCAGTGAGTGCTTTAGAGTTACGCGCCTCTGCAGAATTCTTTTTATGAGTCAAGAAACTGAGTCAGAGCTAAAGTGCTGTCGCTTAAATACTGTGTAGTTTTGCTTGAGCTAGGAGCTCGCTGACTTTCCACTCGCTGCCAGCTTAGGATTTAGTGTTAATACCAAATAATGTAAGACTTTCAGACGCCCCCGGTTCCTCTTTCTTAAACAGGTGTGTTGACCCGCATTTAATTTCATGCCCCGTTAATCCGTATGTACGATTTTGGTCCTTTCAAATATTTCCTGTTTGTTTTCTGTTCCAGATATTAGCCTGTGTTGTTTTTCTTTCTTACTGATTGAGCTATTTCATTACTTGATGCAGGTGATGAG
>atac_peak_5::1:5153075-5153839
CTTAAAATCGGAGTATGAATGCCTGGCCCGCTGAGATGTTTAATTCAAACAAACCCAAACCACTCAAAGGTCATAAACCAAGCTTCTTCAAAGATTTTTGGCTTTTTGGCACCAGTGGCCTGCAGGGTGGCGAGCTCTGCCAGTTTGAAGTGACCAAGTTAAGTGGCCTGGGAAAGGCCATTTGGTGCGCGGTCCAGCAGTTTTGGGCGCTCTCGGCTTCCGCCCTCAGCTGCGGTCACGTGCGGCTGCTCACGTGCCAGACGCTGCTGTCACTTCGTAGCTGTTCCGGCTTCCTCTGAGTGAGGCTCGCAACGTCTCCCACGGAGTCGCCTTCGTTCTGCTCTGGGTCTCCCGTGGCCACTGAGACCTCGGAGCTCGACCGGCGCCTGCCCGCCCGTGCGGCCCTCACTCCCCGAGGCTATCCAGGTGAGGCCGCCTGGGGTCCCTCCCCGGCTCCGGAGAGCCGACTGGTTTCCCTGCCGGCCGCGCCAGGGTCTGCGGAGCCGCAGGGCCTTCCTCCCCTCGGAGGGAAGCAGAGAGCGCGGGGGTGGGCAGGCCTGGGCCGGGCCGGGCCGGGCCGGGAGAGCCGCGGGGGTCGGGCCGCCCGGGAAGGCTGCGCTGTGCCTTTGGTGCCAGCCTTGCCCGGATGGAGCCGGTTTTACCTGGGACCCGCAGTCTGCGCCGACCTTGCTTGTTCAGGCAGCGAGTTCCCACAAAGAGAAATCTAAGTTCCACACTGACTAAGCACTTCGGCAACGCGGAAT
If your SSH session closed and you had to login again, be sure
that you are in your ngs directory
cd ~/ngs
The bedgraph (.bdg) output from Macs2 is the normalized enrichment across the genome. It can be useful for visualization in a genome browser, but we should make it into a bigWig file first.
cp ~/slurm_template.sh bdgtobw.sh
nano bdgtobw.sh) the SLURM script to look like
the following:#!/bin/bash
#SBATCH --partition=batch
#SBATCH --account=ric_workshop
#SBATCH -o %x.o%J
cd $SLURM_SUBMIT_DIR
module load BEDTools
bedtools sort atac_treat_pileup.bdg atac_treat_sorted.bdg
module load ucsc
bedGraphToBigWig atac_treat_sorted.bdg mm39/mm39.fa.fai atac_treat_pileup.bw
Remember: Ctrl+O to save, Ctrl+X to exit from nano.
sbatch bdgtobw.sh
After it finishes, compare file sizes:
ls -lh atac_treat_pileup.*
-rw-r--r-- 1 mmaiensc domain users 489M Jun 27 16:00 atac_treat_pileup.bdg
-rw-r--r-- 1 mmaiensc domain users 80M Jun 27 16:49 atac_treat_pileup.bw
If your last step (peak calling) did not finish yet, you can link to the results below:
ln -s /workshops/ric_workshop/common/3_NGS/atac_treat_pileup.bw
For this exercise we are going to look at the files generated by MACS2.
Once you have download the files open IGV.
Set the genome as mm39.
atac_nodups.bam,
atac_peaks.narrowPeak,atac_summits.bedandatac_treat_pileup.bw`.
(exercise continues on next page)
You have a large number of options for customizing the view:
EXAMPLE: You can also view SNPs and other variants if you zoom in
EXAMPLE For RNA-seq data mapped in a splice-aware manner (STAR or TopHat), we can also see the splice junctions
You can save views from IGV to include in manuscripts, presentations, posters, etc.:
You can save a particular view in IGV as a session. This is very useful if you’ve loaded a lot of data and configured the tracks in a specific way, and you want to save this configuration to return to later.
If your SSH session closed and you had to login again, be sure
that you are in your ngs directory
cd ~/ngs
The samtools faidx command allow one to quickly extract
regions from a FASTA file. As part of this command, samtools will
generate a FASTA index (.fai) file.
samtools faidx mm39.fahead mm39/mm39.fa.fai
1 195154279 56 60 61
2 181755017 198406963 60 61
3 159745316 383191287 60 61
4 156860686 545599081 60 61
5 151758149 705074168 60 61
6 149588044 859361676 60 61
7 144995196 1011442911 60 61
8 130127694 1158854750 60 61
9 124359700 1291151295 60 61
10 130530862 1417583715 60 61
Let’s get the genomic coordinates that correspond to the first peak on chr12.
egrep "^12" atac_peaks.narrowPeak | head -1
12 3159733 3160238 atac_peak_8952 3440 . 97.9324 349.048 344.032 198
module load SAMtools
samtools faidx mm39/mm39.fa 12:3159733-3160238
>12:3159733-3160238
TTCTCTCCTCTCCCAAGAAAGGCTGGTGTCCTTATAAGGGAGACAGCCTCCATTTCCAAC
AGTGTGGAGAGCTAAATAAATACTCTCTCACAGTCAATGCAGAGGTAACTTTTTTTTTTT
TTTTTGATGGATTAAAATCACTGAAAATCACGGAAAATGAGAAATACACACTTTAGGACG
TGAAATATGGCGAGGAAAACTGAAAAAGGTGGAAAATTTAGAAATGTCCACTGTAGGACA
TGGAATATGGCAAGAAAACTGAAAATCATGGAAAATGAGAAACATCCACTTGACGACTTG
AAAAATGACAAAATCACTGAAAAAGGTGAAAAATGAGAAATGCACACTGTAGGACTTGGA
ATATGGCGAGAAAACTGAAAATCACGGAAGTGAGGGGCTTTTGTCTCTCTCTCAGTGGGA
GCTGCCATTTTAATGGGACCTCCAGAGGGGCCTGATTTGATGATGTGAAGTAGGCCTGGA
AGAATCTAGGATTTTTTTTTAAATTT
If your SSH session closed and you had to login again, be sure
that you are in your ngs directory
cd ~/ngs
For this exercise we will be using a BCF file prepared using data deposited with NCBI (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA750802) In particular this BCF file is from one sample (SRA ID: SRR15301098) and after variant calling, the BCF file was subsetted to only include data from genes in chromosome 19.
NOTE: As this is a small file we can run some bcftools commands on the head node. For real data you should perform all of the work on a compute node, either through a SLURM batch script or interactive session.
ln -s /workshops/ric_workshop/common/3_NGS/variants.bcf
module load BCFtools
bcftools plugin counts variants.bcf
Number of samples: 1
Number of SNPs: 24364
Number of INDELs: 2272
Number of MNPs: 1445
Number of others: 403
Number of sites: 28389
-i option lists the filtering criteria. We are
going to filter on depth (DP) and use the data for each
ssample (FORMAT)-O option sets the output type to BCF
(b).-o options specifies the output file. The default
action is to print to STDOUT.bcftools view -i 'QUAL>20 && FORMAT/DP>10' -Ob -o variants_subset.bcf variants.bcf
bcftools plugin counts variants_subset.bcf
Number of samples: 1
Number of SNPs: 3007
Number of INDELs: 333
Number of MNPs: 210
Number of others: 58
Number of sites: 3599
-H flag tells bcftools query to print
a header-f option lists the fields to print. We will print
the chromosome (CHROM), position (POS) and genotype (GT) for each
sample. Per sample format fields, e.g. GT, need to be enclosed in square
brackets to loop over all samples.bcftools query -H -f'%CHROM\t%POS\t[%GT]' variants_subset.bcf | head
#[1]CHROM [2]POS [3]GT
19 4128293 0/1
19 4128312 0/1
19 5798782 0/1
19 6104861 0/1
19 6105124 0/1
19 6105217 0/1
19 6107748 0/1
19 6107755 0/1
19 6790300 0/1
bcftools convert -Oz -o variants.vcf.gz variants.bcf
The following is an example of using the VCF/BCF FILTER field to tag variants as filtered, which still retain the data in the file, and then view/query the variants that pass the filter. Using the FILTER field can be helpful to set various filters but, still be able to see the variants that did not meet the critera. Furthermore, you can apply a set of different filters one at a time and then do a final subsetting of just the variants that PASS.
filter command using the same criteria as step
4 in the previous steps. For this we will run in two steps. The first
will filter based on the QUAL field for the variant and then based on
depth.
-s option sets it as a soft filter and sets the
FILTER field for any variant to be “QUAL_20” if it fails to meet the
first criteria (-i option) and then “DP_10” if fails to
meet the second criteria.bcftools filter -i 'QUAL>20' -s "QUAL_20" -Ou variants.bcf | \
bcftools filter -i 'FORMAT/DP>10' -s "DP_10" -Ob -o variants_filtered.bcf -
bcftools plugin counts variants.bcf
Number of samples: 1
Number of SNPs: 24364
Number of INDELs: 2272
Number of MNPs: 1445
Number of others: 403
Number of sites: 28389
bcftools plugin counts variants_filtered.bcf
Number of samples: 1
Number of SNPs: 24364
Number of INDELs: 2272
Number of MNPs: 1445
Number of others: 403
Number of sites: 28389
bcftools query command to print details
about the variants. We can see that the top variants have a depth less
than 10. Thus the FILTER field is set to “DP_10”bcftools query -H -f '%CHROM\t%POS\t[%GT\t%DP]\t%FILTER' variants_filtered.bcf | head
#[1]CHROM [2]POS [3]GT [4]DP [5]FILTER
19 3096362 0/1 3 DP_10
19 3096637 1/1 2 DP_10
19 3096652 1/1 2 DP_10
19 3096658 1/1 2 DP_10
19 3096695 1/1 2 DP_10
19 3122217 0/0 3 DP_10
19 3122222 0/0 3 DP_10
19 3152171 0/1 3 DP_10
19 3152187 1/1 2 DP_10
bcftools view to subset the varaints based on the
filter status and then pipe to bcftools plugin counts to
see the filtered results.bcftools view -f .,PASS variants_filtered.bcf | bcftools plugin counts
Number of samples: 1
Number of SNPs: 3007
Number of INDELs: 333
Number of MNPs: 210
Number of others: 58
Number of sites: 3599
bcftools view command but, pipe to
bcftools query to print the details.bcftools view -f .,PASS variants_filtered.bcf | \
bcftools query -H -f '%CHROM\t%POS\t[%GT\t%DP]\t%FILTER' - | head
#[1]CHROM [2]POS [3]GT [4]DP [5]FILTER
19 4128293 0/1 13 PASS
19 4128312 0/1 14 PASS
19 5798782 0/1 17 PASS
19 6104861 0/1 29 PASS
19 6105124 0/1 152 PASS
19 6105217 0/1 101 PASS
19 6107748 0/1 39 PASS
19 6107755 0/1 40 PASS
19 6790300 0/1 20 PASS
Download the VCF file (variants.vcf.gz) to your
computer using WinSCP or Cyberduck.
Open IGV and select mm39 as the genome. This was the genome used to create the original BCF file.
Load the compressed VCF file (variants.vcf.gz)
NOTE: The original BCF was created only using data for chromosome 19. So, you would only see variants in that chromosome.
This exercise will use BEDTools to generate a set of merged peaks from four different samples and then quantitate the read coverage for each peak in each sample.
For this example, we have aligned data from four different mouse ATAC-seq samples, sorted, removed PCR duplicates and then subsetted to only include alignments from chromosome 19.
curl -OJL https://wd.cri.uic.edu/NGS/atac_bams.zip
unzip atac_bams.zip
nano peakcall_multi.sh) to run
peak calling (MACS2) on the downloaded BAM files.NOTE: This script will run peak calling on all BAM
files that end with _chr19.bam. So, be careful if you have
other BAM files in this directory. Furthermore, we have adjusted the
genome size to just be the size of chromosome 19 in mm39 as the BAM
files have been subsetted to just chromosome 19.
For each BAM file, this script will set the variable
$sample to the value of the name of BAM file without the
“.bam” at the end.
#!/bin/sh
#SBATCH --mem=10G
#SBATCH --partition=batch
#SBATCH --account=ric_workshop
#SBATCH -o %x.o%J
cd $SLURM_SUBMIT_DIR
module load MACS2
for i in *_chr19.bam; do
sample=$(basename $i .bam)
macs2 callpeak -t $i -f BAM -n ${sample} -g 61420004 -B --SPMR --nomodel --nolambda
done
sbatch peakcall_multi.sh)Once the peak calling is complete you should see a set of files for each sample.
wc -l *.narrowPeak
15 A.1_chr19_peaks.narrowPeak
20 A.2_chr19_peaks.narrowPeak
708 B.1_chr19_peaks.narrowPeak
779 B.2_chr19_peaks.narrowPeak
1522 total
nano quant_atac.sh) to merge the
peaks and then requantitate for each sample.
merged_peaks.bed
cat) all of the .narrowPeak
files into one BED filecut command will just report the first three
columns, i.e. chromosome, start and end positions.bedtools sort command will sort the combined BED
data by chromosome and start positionbedtools merge will merge overlapping regions
listed in the BED dataawk command will add an extra column to the merged
output. This column will have the string “atac_peak_” followed by the
line number in the output. This will give a nice name to all of the
merged peaks.files= will create
a header line for the counts table.
files as an array.\t). This way when we print out the array files it will
automatically use a tab to separate the values.bedtools multicov to compute the read coverage in
the list of BAM files for each region in the supplied BED file.#!/bin/sh
#SBATCH --mem=10G
#SBATCH --partition=batch
#SBATCH --account=ric_workshop
#SBATCH -o %x.o%J
cd $SLURM_SUBMIT_DIR
module load BEDTools
cat *.narrowPeak | cut -f1-3 \
| bedtools sort | bedtools merge \
| awk -F "\t" 'BEGIN { OFS="\t" } { print $0, "atac_peak_" NR }' > merged_peaks.bed
files=(*_chr19.bam)
IFS="\t"
echo -e "CHROMO\tSTART\tEND\tNAME\t${files}" > atac_quant.txt
bedtools multicov -bams *_chr19.bam -bed merged_peaks.bed >> atac_quant.txt
Submit the job to the queue
(sbatch quant_atac.sh)
After the job is complete you can check the output file
(atac_quant.txt)
head atac_quant.txt
CHROMO START END NAME A.1_chr19.bam
19 3373242 3373467 atac_peak_1 3 0 2 13
19 3384176 3384376 atac_peak_2 0 2 0 6
19 3410001 3410201 atac_peak_3 0 0 0 5
19 3438700 3438900 atac_peak_4 0 1 0 4
19 3439226 3439440 atac_peak_5 0 0 8 8
19 3473089 3473382 atac_peak_6 0 0 10 4
19 3480788 3481103 atac_peak_7 0 1 10 12
19 3501776 3502147 atac_peak_8 0 2 10 12
19 3625743 3626379 atac_peak_9 4 18 34 36
You can then perform differential analysis on the counts in this file using edgeR. Note, when you read in the table to R for analysis with edgeR you would want to strip off the first three columns so that that the feature ID would be the peak names given. After analysis you can then merge back the first three columns, i.e. chromosome, start and end, to get a sense of the genomic locations for each of the peaks.
atac_table <- read.delim("atac_quant.txt")
atac_data <- atac_table[,-(1:3)]
atac_peaks <- atac_table[,c(4,1:3)]
#
# Run differential analysis using edgeR....
# For this we are skipping the lines you would run for a normal edgeR analysis
# and jumping ahead to when one gets the diff table.
#
diff = stats$table
# Merge back the details of the peaks, i.e. chromosome, start and end
diff <- merge(atac_peaks, diff, by.x=1, by.y=0)
# Write to a file
write.table(diff, "diff.txt", sep="\t", quote=F
The following are example exercises using the UCSC genome browser.
For this example, We’ll take a look at the TP53 locus in hg38 (most recent human reference genome version).
Download a gene annotation file for hg38:
Table browser:
Setting filters:
The liftOver tools allows us to convert coordinates between assemblies of the same organism (e.g., mm39 to mm9), or even between organisms based on homology (e.g., mm39 to hg38). We can try it on our ATAC-seq peaks.
NOTE For this exercise we have removed the parition/queue and accounting details from the SLURM scripts. For a short time after the end of the workshop you should still be able to use the ric_worshop accouting code with the batch partition. However, if you plan to run these exercise later be sure to set the proper partition and accounting code for your HPC account.
For this example, we will be using data deposited with NCBI (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA750802) These data were obtained from mouse samples. So, we will be using the mm39 reference in this example.
As a test case, you can use the following SLURM script to download the data for one of the samples (SRA ID: SRR15301098) and create a subset of the data. This script will subset the first 1 million read (4 million FASTQ lines).
#!/bin/sh
#SBATCH --time=10:00:00
#SBATCH --mem=10G
#SBATCH -o %x.o%J
cd $SLURM_SUBMIT_DIR
module load SRA-Toolkit
fasterq-dump SRR15301098 -O raw_data -p --split-3
for a_file in raw_data/*.fastq ; do
sample=$(basename $a_file .fastq)
echo "$sample"
head -n 4000000 $a_file | gzip -c > ${sample}.fastq.gz
done
For the genome alignment we can use BWA-MEM and thus can use the same genome index used in the Genome Alignment exercise (1.x). In the alignment step, we will also sort, remove any PCR duplicates and index the resulting BAM file.
nano mapping_vc.sh) that looks
like the following.#!/bin/sh
#SBATCH --mem=10G
#SBATCH --ntasks-per-node=4
#SBATCH -o %x.o%J
cd $SLURM_SUBMIT_DIR
module load BWA
module load SAMtools
bwa mem -t $SLURM_TASKS_PER_NODE mm39/mm39.fa SRR15301098_1.fastq.gz SRR15301098_2.fastq.gz \
| samtools view -b - > var_calling.bam
samtools sort var_calling.bam > var_calling_sort.bam
samtools rmdup var_calling_sort.bam var_calling_nodups.bam
samtools index var_calling_nodups.bam
Remember: Ctrl+O to save, Ctrl+X to exit from nano.
sbatch mapping_vc.sh
We will use freebayes (https://github.com/freebayes/freebayes) to call the variants from the alignment files. We will also use bcftools to convert the output from freebayes (VCF) to a compressed BCF file
nano freebayes.sh) that looks
like the following.#!/bin/sh
#SBATCH --mem=10G
#SBATCH -o %x.o%J
cd $SLURM_SUBMIT_DIR
module load freebayes
module load BCFtools
freebayes --fasta-reference mm39/mm39.fa var_calling_nodups.bam \
| bcftools -o variants.bcf -O b -
Remember: Ctrl+O to save, Ctrl+X to exit from nano.
sbatch freebayes.sh
bcftools has the ability to call variants from an alignment file. This may not be a sophisticated or robust as freebayes. However, depending on the goal of your project it may work for you.
nano bcftools_call.sh) that
looks like the following.#!/bin/sh
#SBATCH --mem=10G
#SBATCH -o %x.o%J
cd $SLURM_SUBMIT_DIR
module load BCFtools
bcftools mpileup -Ou -f mm39/mm39.fa var_calling_chr19.bam | \
bcftools call -Ob -mv > variants.bcf
Remember: Ctrl+O to save, Ctrl+X to exit from nano.
sbatch bcftools_call.sh
NOTE For this exercise we have removed the parition/queue and accounting details from the SLURM scripts. For a short time after the end of the workshop you should still be able to use the ric_worshop accouting code with the batch partition. However, if you plan to run these exercise later be sure to set the proper partition and accounting code for your HPC account.
For this example, we will be using data deposited with NCBI (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1252737) These data were obtained from mouse samples. So, we will be using the mm39 reference in this example.
As a test case, you can use the following SLURM script to download the data for one of the samples (SRA ID: SRR33213504) and create a subset of the data. This script will subset the first 1 million read (4 million FASTQ lines).
#!/bin/sh
#SBATCH --time=10:00:00
#SBATCH --mem=10G
#SBATCH -o %x.o%J
cd $SLURM_SUBMIT_DIR
module load SRA-Toolkit
fasterq-dump SRR33213504 -O raw_data -p --split-3
for a_file in raw_data/*.fastq ; do
sample=$(basename $a_file .fastq)
echo "$sample"
head -n 4000000 $a_file | gzip -c > ${sample}.fastq.gz
done
Like other alignment tools, Bismark needs its own index and this will only need to be run once per reference.
mkdir bismark_ref_mm39
cd bismark_ref_mm39
curl https://ftp.ensembl.org/pub/current/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.toplevel.fa.gz | zcat > mm39.fa
nano index.sh) that looks like
the following.#!/bin/sh
#SBATCH --time=100:00:00
#SBATCH --mem=30G
#SBATCH -o %x.o%J
cd $SLURM_SUBMIT_DIR
module load Bismark
bismark_genome_preparation bismark_ref_mm39
sbatch index.sh
nano run_bismark.sh) that looks
like the following.
bismark command will run the alignment and
generate a BAM file.bismark_methylation_extractor command will extract
a methylation report from the BAM file produced by Bismark.--comprehensive option will have the command merge
strand specific methylation data for each context, e.g. CpG, CHG, CHH,
into a single file for each context.--gzip will compress the output using gzip. The
table can be very large and it is recommended to have it compressed. It
is possible to gzipped table directly into R.#!/bin/sh
#SBATCH --time=100:00:00
#SBATCH --mem=30G
#SBATCH --ntasks-per-node=4
#SBATCH -o %x.o%J
cd $SLURM_SUBMIT_DIR
module load Bismark
bismark -p 4 --genome bismark_ref_mm39 -1 SRR33213504_1.fastq.gz -2 SRR33213504_2.fastq.gz
bismark_methylation_extractor --gzip --comprehensive -p SRR33213504_1_bismark_bt2_pe.bam
sbatch run_bismark.sh
The output from bismark_methylation_extractor is a per
sequence report with the following columns. seq-ID, methylation state,
chromosome, start position (= end position), methylation call. You can
use the following R code to summarize the report to create counts per
genomic loci.
library(dplyr)
me_full_report <- read.delim(gzfile("CpG_context_SRR33213504_1_bismark_bt2_pe.txt.gz"), skip=1,
col.names=c("sequence", "me_state", "chromosome", "posn", "me_call"))
me_counts <- subset(me_full_report, me_state == "+") %>%
group_by(chromosome, posn) %>%
summarize(me_count=n())
all_counts <- me_full_report %>%
group_by(chromosome, posn) %>%
summarize(all_count=n())
me_report <- full_join(all_counts, me_counts, by=join_by(chromosome, posn)) %>%
mutate(me_count=coalesce(me_count, 0)) %>%
mutate(me_fraction=me_count / all_count)
After creating the me_report data.frame you may want to filter the loci to only retain those sites with a minimum level of coverage.